Skip to content

Add safe read-write lock#236

Open
Shourya742 wants to merge 2 commits intostratum-mining:mainfrom
Shourya742:2026-02-03-add-fast-read-write-lock-implementation
Open

Add safe read-write lock#236
Shourya742 wants to merge 2 commits intostratum-mining:mainfrom
Shourya742:2026-02-03-add-fast-read-write-lock-implementation

Conversation

@Shourya742
Copy link
Collaborator

@Shourya742 Shourya742 commented Feb 4, 2026

closes: #255

@Shourya742 Shourya742 force-pushed the 2026-02-03-add-fast-read-write-lock-implementation branch from 7958eb3 to af042d1 Compare February 4, 2026 12:39
@plebhash
Copy link
Member

plebhash commented Feb 5, 2026

this PR is introducing parking_lot as a dependency to stratum-apps, which is the foundational crate for the application layer of the Stratum V2 Reference Implementation codebase

parking_lot crate repo is: https://github.com/Amanieu/parking_lot

relevant statistics on the repo:

  • 556k "users" on github
  • "recent" commits (~2months ago)
  • was bootstrapped around ~10y ago

these numbers indicate it could potentially remain well maintained for mid-long term, but I'll try tagging @Amanieu here to hear his plans for it

@@ -0,0 +1,20 @@
use parking_lot::RwLock as InnerRwLock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parking_lot has interesting potential

but we should evaluate those trade-offs later

the proposal here is to add a variant to custom_mutex.rs which uses std::sync::Mutex, so we should stick with the std::sync::RwLock to reduce review complexity here

@Shourya742 Shourya742 force-pushed the 2026-02-03-add-fast-read-write-lock-implementation branch from 6dda40e to 1eb2abe Compare February 9, 2026 05:43
@Shourya742 Shourya742 marked this pull request as ready for review February 9, 2026 05:43
@Shourya742
Copy link
Collaborator Author

Shourya742 commented Feb 9, 2026

On the apps side, we are moving from a single lock guarding a wide structure with multiple embedded types to a micro-locking model, where each type owns its own lock. During the DashMap migration and the trimming of tproxy
(see: #228), we observed that the vast majority of access patterns are read-heavy, with relatively few writes. This prompted us to re-evaluate the custom mutex implementation and motivated the introduction of a custom RwLock, which is a better fit for these usage patterns.

During last week’s Vintuem call, we also discussed that introducing parking_lot at this stage would significantly increase the review surface of this PR, since it would require a deeper audit of locking behavior and failure modes. For now, this PR has been updated to use the standard library locks (commit 1eb2abe). A migration to parking_lot can be revisited later, once we have concrete numbers to justify the change.

Implementation nuances

As we move toward micro-locks per structure, lock ergonomics become more challenging. Relying solely on closure-based “safe” APIs quickly leads to deeply nested callbacks, which hurts readability and maintainability:

a.safe_read(|x| {
    b.safe_write(|x| {
        c.safe_read(|x| {
        })
    })
})

To avoid this, the custom implementation provides both closure-based safe read/write APIs and explicit read/write guard APIs. These guards are !Send, which makes them unusable across .await points by construction. As a result, releasing the lock becomes an explicit and deliberate action by the caller. This approach preserves the safety guarantees of the closure-based APIs while avoiding callback hell and making lock lifetimes more visible. If a guard is not explicitly released, the implementation will panic, making incorrect usage fail fast rather than silently causing deadlocks or prolonged lock retention. You can look at the test: 740bca8 to understand how it works better.

Poisoning semantics

One important distinction to highlight is lock poisoning, where the standard library and parking_lot differ in behavior.

  • parking_lot does not implement poisoning; a lock is always considered acquirable, regardless of whether a previous holder panicked.
  • std::sync locks treat poisoning as a policy decision: once a lock is poisoned, future acquisition attempts return an error unless the caller explicitly chooses to recover from the poisoned state.

This difference directly impacts the custom_rw_lock APIs. With a std-backed implementation, the APIs must be poison-aware and propagate poisoning semantics to the caller. With parking_lot, these semantics would not exist. This behavioral difference is a key consideration for any future migration.

@Shourya742 Shourya742 force-pushed the 2026-02-03-add-fast-read-write-lock-implementation branch from 93107f9 to 740bca8 Compare February 9, 2026 12:44
@jbesraa
Copy link
Contributor

jbesraa commented Feb 10, 2026

Out of curiosity, did you guys consider using https://crates.io/crates/arc-swap?

Comment on lines +105 to +114
impl<T: ?Sized> Drop for ReadGuard<'_, T> {
fn drop(&mut self) {
if !self.released.load(Ordering::Acquire) {
panic!(
"ReadGuard dropped without explicit release(); \
this is a bug. Call release() to acknowledge lock lifetime."
);
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider the following scenario.

If this is used across the apps and a panic occurs during message processing before released = true, we trigger a second panic: one from the failing downstream operation, and another from Drop.

Like I did below:

#[test]
fn test_poisoning() {
    let lock = RwLock::new(0);

    let mut guard = lock.write().unwrap();
    *guard = 1;
    *guard = 2;
    panic!("Simulated panic while holding write lock");
}

Because the second panic happens inside Drop, the runtime aborts:

thread 'custom_rw_lock::tests::test_poisoning' panicked at src/custom_rw_lock.rs:250:9:
Simulated panic while holding write lock

thread 'custom_rw_lock::tests::test_poisoning' panicked at src/custom_rw_lock.rs:152:13:
WriteGuard dropped without explicit release(); this is a bug. Call release() to acknowledge lock lifetime.
stack backtrace:
[trimmed]
thread 'custom_rw_lock::tests::test_poisoning' panicked at library/core/src/panicking.rs:226:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

Instead of just dropping the faulty client, the entire process aborts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A panic is always fatal, that’s the rule of thumb so in that sense it’s acceptable. Given how much effort we’ve put into improving error handling, we shouldn’t be hitting panics in normal operation anyway.

That said, this does highlight a design concern. With the use cases we have, the current locking pattern doesn’t fully mimic the control flow flexibility we get with closures. In a closure, we can return early and rely on scope-based cleanup. In these read/write cases, we don’t have that same safety.

For example:

let xyz = channel_manager.read().unwrap();
...
f()?; // if this returns early, we never explicitly release the lock
...

xyz.release();

If an early return happens, we lose the opportunity to release the lock, which could lead to a panic or a poisoned state.

I need to think more about how to structure this so we don’t end up shooting ourselves in the foot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just removed the explicit lifetime context altogether, its not possible in rust to gain closure type linear typing during runtime,

@Shourya742 Shourya742 force-pushed the 2026-02-03-add-fast-read-write-lock-implementation branch 2 times, most recently from 4995b82 to 66d6d43 Compare February 17, 2026 14:51
@Shourya742
Copy link
Collaborator Author

Keeping this pr simple closure based only.

@lucasbalieiro
Copy link
Collaborator

Keeping this pr simple closure based only.

ACK, did you removed the unit tests by accident?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

need to add a RwLock variant to stratum_apps::custom_mutex::Mutex

5 participants